{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "RE正则表达式学习"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "# - 常用操作符\n",
    "    - `.` 表示任何单个字符\n",
    "    - `[ ]` 字符集，对单个字符给出取值范围 ，如`[abc]`表示a、b、c，`[a‐z]`表示a到z单个字符\n",
    "    - `[^ ]` 非字符集，对单个字符给出排除范围 ，如`[^abc]`表示非a或b或c的单个字符\n",
    "    - `*` 前一个字符0次或无限次扩展，如abc* 表示 ab、abc、abcc、abccc等 \n",
    "    - `+` 前一个字符1次或无限次扩展 ，如abc+ 表示 abc、abcc、abccc等 \n",
    "    - `?` 前一个字符0次或1次扩展 ，如abc? 表示 ab、abc\n",
    "    - `|` 左右表达式任意一个 ，如abc|def 表示 abc、def\n",
    "\n",
    "    - `{m}` 扩展前一个字符m次 ，如ab{2}c表示abbc\n",
    "    - `{m,n}` 扩展前一个字符m至n次（含n） ，如ab{1,2}c表示abc、abbc\n",
    "    - `^` 匹配字符串开头 ，如^abc表示abc且在一个字符串的开头\n",
    "    - `$` 匹配字符串结尾 ，如abc$表示abc且在一个字符串的结尾\n",
    "    - `( )` 分组标记，内部只能使用 | 操作符 ，如(abc)表示abc，(abc|def)表示abc、def\n",
    "    - `\\d` 数字，等价于`[0‐9]`\n",
    "    - `\\w` 单词字符，等价于`[A‐Za‐z0‐9_]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "默认re库为贪婪匹配"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以下例子没有解析出来内容 需要重新调试"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入包\n",
    "import requests\n",
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [],
   "source": [
    "def getHTMLText(url):\n",
    "    \"\"\"\n",
    "    请求获取html，（字符串）\n",
    "    :param url: 爬取网址\n",
    "    :return: 字符串\n",
    "    \"\"\"\n",
    "    try:\n",
    "        # 添加头信息,\n",
    "        kv = {\n",
    "            #'cookie': 'thw=cn; v=0; t=ab66dffdedcb481f77fd563809639584; cookie2=1f14e41c704ef58f8b66ff509d0d122e; _tb_token_=5e6bed8635536; cna=fGOnFZvieDECAXWIVi96eKju; unb=1864721683; sg=%E4%B8%8B3f; _l_g_=Ug%3D%3D; skt=83871ef3b7a49a0f; cookie1=BqeGegkL%2BLUif2jpoUcc6t6Ogy0RFtJuYXR4VHB7W0A%3D; csg=3f233d33; uc3=vt3=F8dBy3%2F50cpZbAursCI%3D&id2=UondEBnuqeCnfA%3D%3D&nk2=u%2F5wdRaOPk21wDx%2F&lg2=VFC%2FuZ9ayeYq2g%3D%3D; existShop=MTU2MjUyMzkyMw%3D%3D; tracknick=%5Cu4E36%5Cu541B%5Cu4E34%5Cu4E3F%5Cu5929%5Cu4E0B; lgc=%5Cu4E36%5Cu541B%5Cu4E34%5Cu4E3F%5Cu5929%5Cu4E0B; _cc_=WqG3DMC9EA%3D%3D; dnk=%5Cu4E36%5Cu541B%5Cu4E34%5Cu4E3F%5Cu5929%5Cu4E0B; _nk_=%5Cu4E36%5Cu541B%5Cu4E34%5Cu4E3F%5Cu5929%5Cu4E0B; cookie17=UondEBnuqeCnfA%3D%3D; tg=0; enc=2GbbFv3joWCJmxVZNFLPuxUUDA7QTpES2D5NF0D6T1EIvSUqKbx15CNrsn7nR9g%2Fz8gPUYbZEI95bhHG8M9pwA%3D%3D; hng=CN%7Czh-CN%7CCNY%7C156; mt=ci=32_1; alitrackid=www.taobao.com; lastalitrackid=www.taobao.com; swfstore=97213; x=e%3D1%26p%3D*%26s%3D0%26c%3D0%26f%3D0%26g%3D0%26t%3D0%26__ll%3D-1%26_ato%3D0; uc1=cookie16=UtASsssmPlP%2Ff1IHDsDaPRu%2BPw%3D%3D&cookie21=UIHiLt3xThH8t7YQouiW&cookie15=URm48syIIVrSKA%3D%3D&existShop=false&pas=0&cookie14=UoTaGqj%2FcX1yKw%3D%3D&tag=8&lng=zh_CN; JSESSIONID=A502D8EDDCE7B58F15F170380A767027; isg=BMnJJFqj8FrUHowu4yKyNXcd2PXjvpa98f4aQWs-RbDvsunEs2bNGLfj8BYE6lWA; l=cBTDZx2mqxnxDRr0BOCanurza77OSIRYYuPzaNbMi_5dd6T114_OkmrjfF96VjWdO2LB4G2npwJ9-etkZ1QoqpJRWkvP.; whl=-1%260%260%261562528831082',\n",
    "            'cookie':'miid=1111021262900328171; cna=SgWeFCHgyQ0CAbZq1I45tRzJ; tracknick=tb5780965_2012; tg=0; enc=Bj2L1JeDqxxMWnn2pRfpN4AnLoEvMiyhdIptVznJReZTFUbDVK720pVIX6A6EBe4dqxcrh6kOG1DSvHERyqEaw%3D%3D; x=e%3D1%26p%3D*%26s%3D0%26c%3D0%26f%3D0%26g%3D0%26t%3D0%26__ll%3D-1%26_ato%3D0; UM_distinctid=16e64d9de8ea1-0c61a90bb2c827-3a614f0b-100200-16e64d9de8ff; t=5994a8c7be75a168fe716edc02758221; thw=cn; _cc_=WqG3DMC9EA%3D%3D; JSESSIONID=610428E3CE9003C62CC01E4CE5E60CE2; cookie2=148141800b47efdef5387f2ac16fd531; _tb_token_=937136733e63; hng=CN%7Czh-CN%7CCNY%7C156; _samesite_flag_=true; CNZZDATA1258427669=710665545-1587636331-%7C1587636331; l=eBIWqfR7vMbyV819BOfaFurza779tIR46uPzaNbMiT5P_pCw5ZsFWZjX1EYeCnGVHsCXS3RA-8MpBmTFqyFq0-Y3L3k_J_DmndC..; isg=BKqqBreRgMQ4cQ4w6UMpkks1-xBMGy51neDg4jRjFP2OZ0ghGqqghbEV85P7yKYN',\n",
    "            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'\n",
    "        }#什么时候需要加cookie？ 通常是需要登陆才能浏览的网页\n",
    "        r = requests.get(url, timeout=30, headers=kv)\n",
    "        # r = requests.get(url, timeout=30)\n",
    "        # print(r.status_code)\n",
    "        r.raise_for_status()#返回状态码\n",
    "        #r.encoding = r.apparent_encoding\n",
    "        return r.text\n",
    "    except:\n",
    "        return \"爬取失败\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [],
   "source": [
    "def parsePage(glist, html):\n",
    "    '''\n",
    "    解析网页，搜索需要的信息\n",
    "    :param glist: 列表作为存储容器\n",
    "    :param html: 由getHTMLText()得到的\n",
    "    :return: 商品信息的列表\n",
    "    '''\n",
    "    try:\n",
    "        # 使用正则表达式提取信息\n",
    "        price_list = re.findall(r'\\\"view_price\\\"\\:\\\"[\\d\\.]*\\\"', html)\n",
    "        name_list = re.findall(r'\\\"raw_title\\\"\\:\\\".*?\\\"', html)\n",
    "        for i in range(len(price_list)):\n",
    "            price = eval(price_list[i].split(\":\")[1])  #eval（）在此可以去掉\"\"\n",
    "            name = eval(name_list[i].split(\":\")[1])\n",
    "            glist.append([price, name])\n",
    "    except:\n",
    "        print(\"解析失败\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [],
   "source": [
    "def printGoodList(glist):\n",
    "    tplt = \"{0:^4}\\t{1:^6}\\t{2:^10}\" #控制输出格式\n",
    "    print(tplt.format(\"序号\", \"商品价格\", \"商品名称\"))\n",
    "    count = 0\n",
    "    for g in glist:\n",
    "        count = count + 1\n",
    "        print(tplt.format(count, g[0], g[1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 根据页面url的变化寻找规律，构建爬取url\n",
    "goods_name = \"书包\"  # 搜索商品类型\n",
    "start_url = \"https://s.taobao.com/search?q=\" + goods_name\n",
    "info_list = []\n",
    "page = 3  # 爬取页面数量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "爬取页面当前进度: 100.00%"
     ]
    }
   ],
   "source": [
    "count = 0\n",
    "for i in range(page):\n",
    "    count += 1\n",
    "    try:\n",
    "        url = start_url + \"&s=\" + str(44 * i)\n",
    "        html = getHTMLText(url)  # 爬取url\n",
    "        #print(html)\n",
    "        parsePage(info_list, html) #解析HTML和爬取内容\n",
    "        print(\"\\r爬取页面当前进度: {:.2f}%\".format(count * 100 / page), end=\"\")  # 显示进度条\n",
    "    except:\n",
    "        continue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 序号 \t 商品价格 \t   商品名称   \n",
      " 1  \t138.00\t电脑包大书包大学生女生背包大双肩包大容量\n",
      " 2  \t59.00 \t双肩包男士大容量旅行电脑背包时尚潮流高中初中学生书包女大学生\n",
      " 3  \t149.00\t鳄鱼男士双肩包商务休闲电脑帆布背包旅游旅行包时尚潮流学生书包\n",
      " 4  \t149.00\t花花公子男士背包2020年新款商务电脑双肩包高中学生大容量书包\n",
      " 5  \t279.00\tJordan 官方 AIR JORDAN 双肩包 书包背包\n",
      " CW7699\n",
      " 6  \t288.00\t【新品】JanSport杰斯伯双肩包女学生书包男背包运动休闲背包4QUT\n",
      " 7  \t669.00\tkipling女士帆布背包2020年新款时尚简约休闲潮流书包双肩包|ROSE\n",
      " 8  \t139.00\t花花公子男士双肩包时尚潮流个性大学生书包休闲旅行电脑迷彩背包\n",
      " 9  \t689.00\t背包双肩包男士商务旅行背包防盗电脑包休闲书包男多功能大旅游包\n",
      " 10 \t99.90 \t迪卡侬户外双肩背包男女休闲双肩包运动学生小书包轻便10L QUBP\n",
      " 11 \t998.00\tHerschel Little America经典色旅游双肩包男女士背包书包10020\n",
      " 12 \t255.00\tViney双肩包包2020新款潮真皮时尚背包女大容量书包韩版百搭女包\n",
      " 13 \t85.00 \tIT男程序员周边青年学生电脑背包书包双肩包电脑包14-16寸usb充电\n",
      " 14 \t149.00\tPUMA彪马背包2020新款女包双肩包拎包休闲小包PU小书包076960-02\n",
      " 15 \t899.00\t【买1送1】Fion/菲安妮大牌双肩包女 2020新款印花背包旅行书包\n",
      " 16 \t479.00\tFjallraven/瑞典北极狐双肩包kanken书包女电脑旅行背包官方23510\n",
      " 17 \t358.00\tJanSport旗舰店官网女双肩背包学生书包电脑包男背包 3P6X008\n",
      " 18 \t219.00\tAdidas阿迪达斯书包女2020新款粉色运动背包初高中学生双肩包女包\n",
      " 19 \t249.90\t迪卡侬旗舰店新款大容量双肩背包户外运动防水书包休闲男女TRD\n",
      " 20 \t1098.00\tHerschel Dawson大号时尚双肩包 Surplus系列休闲背包书包10649\n",
      " 21 \t189.00\t阿迪达斯双肩包男女2020新款初中生高中学生书包大容量背包DT8638\n",
      " 22 \t179.00\t安踏双肩包2020潮牌大容量旅行背包男休闲简约学生书包运动背包男\n",
      " 23 \t219.00\tNIKE耐克双肩包男包女包2020春季新款户外运动包学生书包旅行背包\n",
      " 24 \t428.00\tTeenmix/天美意2020春新款商场同款外出纯色校园双肩书包X1900AX0\n",
      " 25 \t319.00\tNIKE耐克双肩包2020夏季新款男包女包学生书包运动包背包潮BA6097\n",
      " 26 \t139.00\t阿迪达斯学生书包男女包初中高中大学生电脑包运动双肩背包FI7968\n",
      " 27 \t199.00\t瑞士军士刀双肩包男大容量休闲商务旅行电脑背包男士初中学生书包\n",
      " 28 \t139.00\t牛津布双肩包女2020新款韩版时尚百搭简约旅行防盗背包帆布书包潮\n",
      " 29 \t119.00\t特步男女双肩包2020夏季新款大容量书包百搭潮流男士女士运动背包\n",
      " 30 \t69.00 \t牛津布双肩包女2020新款潮韩版时尚百搭大学生书包旅行帆布小背包\n",
      " 31 \t149.00\t特步男女双肩包2020春季新款综训背包舒适简约纯色书包男运动背包\n",
      " 32 \t129.00\tuek小学生书包男孩女生一三五 六年级护脊双肩6-12岁轻便减压儿童\n",
      " 33 \t438.00\ttigerfamily小学生书包1-3年级男女孩儿童书包减负护脊背包6周岁\n",
      " 34 \t299.00\t【直营】Adidas双肩包男女CL AOP运动休闲舒适学生书包背包FM6896\n",
      " 35 \t498.00\tTiger Family护脊减负书包 小学生3-5年级儿童女12周岁男童背包\n",
      " 36 \t179.00\tPUMA彪马双肩包男包女包2020新款运动包学生书包潮休闲包旅行背包\n",
      " 37 \t178.00\t迪士尼小学生书包女童1-3-4一三年级冰雪奇缘女孩减负儿童双肩包6\n",
      " 38 \t175.00\t不莱玫迪士尼合作款书包女韩版高中百搭ins双肩包时尚可爱小背包\n",
      " 39 \t970.00\tGaston Luga瑞典潮牌背包男双肩包女大容量旅行包休闲书包电脑包\n",
      " 40 \t218.00\t【直营】Puma彪马女包双肩包运动包学生书包休闲包背包076944-02\n",
      " 41 \t419.00\tFjallraven/北极狐双肩包kanken mini 迷你情侣书包背包女23561\n",
      " 42 \t115.00\tkk树书包小学生女孩6-12岁儿童一二三到六年级女童双肩包护脊减负\n",
      " 43 \t219.00\tPUMA彪马双肩包男包女包2019新款运动包休闲背包学生书包074706\n",
      " 44 \t268.00\tBOPAI博牌电脑背包男户外旅行休闲双肩包商务书包出差多功能男包\n",
      " 45 \t869.00\tkipling男女大容量电脑包2020新款时尚书包旅行包双肩包|SO BABY\n",
      " 46 \t69.90 \t大脸兔牛津布双肩包女2020新款韩版尼龙百搭旅行防水超轻背包书包\n",
      " 47 \t899.00\t挪威官方正品Beckmann小学生书包女男儿童护脊减压背包1-3年级\n",
      " 48 \t175.00\t不莱玫迪士尼米奇双肩包新款韩版高中复古背包大容量学生帆布书包\n",
      " 49 \t899.00\t挪威官方正品Beckmann小学生书包女男儿童护脊减压背包1-3年级\n",
      " 50 \t219.00\tPUMA彪马双肩包男包女包2019新款运动包休闲背包学生书包074706\n",
      " 51 \t115.00\tkk树书包小学生女孩6-12岁儿童一二三到六年级女童双肩包护脊减负\n",
      " 52 \t268.00\tBOPAI博牌电脑背包男户外旅行休闲双肩包商务书包出差多功能男包\n",
      " 53 \t129.00\t小米双肩包书包男女笔记本电脑包时尚潮流旅行背包\n",
      " 54 \t259.00\tHype双肩包少女渐变小清新背包简约时尚百搭ins风潮牌大学生书包\n",
      " 55 \t129.00\t七匹狼双肩包男大容量背包书包新款超大商务休闲旅行笔记本电脑包\n",
      " 56 \t499.00\t日本进口卡芙露书包小学生1-3年级6儿童轻便减负护脊男女双肩背包\n",
      " 57 \t408.00\tFILA斐乐小学生书包大容量男女童背包2020春新款儿童双肩包3M反光\n",
      " 58 \t139.00\t花花公子男士双肩包时尚潮流休闲初中学生书包大学生电脑旅行背包\n",
      " 59 \t89.00 \t花花公子双肩包女2020年新款百搭大学生背包韩版初中高中学生书包\n",
      " 60 \t145.00\t不莱玫迪士尼合作款双肩包女韩版百搭可爱小书包ins潮酷旅行背包\n",
      " 61 \t698.00\tHerschel Retreat春夏新色旅游双肩包男女士书包背包百搭10066\n",
      " 62 \t288.00\t【新品】JanSport杰斯伯双肩包女学生书包电脑包休闲背包4QUT5L8\n",
      " 63 \t498.00\tHerschel City中号校园双肩包男书包背包潮牌女 ins 百搭10486\n",
      " 64 \t358.00\tJanSport旗舰店官网双肩背包女学生书包电脑包男背包 3P6X04V\n",
      " 65 \t149.00\tPUMA彪马官网旗舰双肩包男包女包2020新款初中高中学生书包电脑包\n",
      " 66 \t378.00\ttigerfamily儿童书包小学生一年级1-3 女男6岁耐磨减负护脊双肩包\n",
      " 67 \t149.00\tPUMA彪马官网正品双肩包背包初中高中学生书包旅游包休闲运动包潮\n",
      " 68 \t289.00\tViney双肩包女韩版百搭ins原宿大容量百搭背包书包时尚简约双肩包\n",
      " 69 \t589.00\tkipling女士多背法背包2020年新款时尚潮简约书包双肩包|IVES系列\n",
      " 70 \t698.00\tkipling女大容量背包春夏新品时尚简约潮流休闲书包双肩包|MATTA\n",
      " 71 \t1588.00\t【GPS定位】英国AnythingStudio小学生书包 儿童女进口英伦日本风\n",
      " 72 \t399.00\tFILA斐乐童装旗舰店儿童双肩包小学生书包男童女童低年级背包新款\n",
      " 73 \t226.00\t真皮双肩包女2020年新款书包女百搭大容量头层牛皮女士软皮背包潮\n",
      " 74 \t699.00\t北极狐laptop笔记本电脑包13/15/17英寸男女手提双肩背包学生书包\n",
      " 75 \t229.00\t迪士尼商店 冰雪奇缘艾莎公主小学生书包儿童书包双肩包女童书包\n",
      " 76 \t188.00\t优仅ALLJOINT儿童书包可爱幼儿园双肩甜甜圈彩虹幼儿背包女童包包\n",
      " 77 \t218.00\t【直营】Puma彪马女包双肩包运动包学生书包休闲包背包076944-02\n",
      " 78 \t488.00\t香港tigerfamily小学生护脊书包 男女5-9年级初中学生减负双肩包\n",
      " 79 \t389.00\t朱丹推荐诺狐书包小学生女孩一二三到六年级护脊减负儿童双肩背包\n",
      " 80 \t499.00\tFION/菲安妮新款双肩包旅行包 女士印花背包青年防水名牌书包小包\n",
      " 81 \t479.00\tFjallraven/北极狐书包kanken双肩包女户外包运动背包男23510\n",
      " 82 \t998.00\tFion/菲安妮休闲双肩包潮流学生书包 2020新款女包尼龙黑色旅行包\n",
      " 83 \t2598.00\t【亚洲限定款】天使之翼SEIBAN 日本保税护脊减负小学生粉色书包\n",
      " 84 \t188.00\t迪士尼拉杆书包小学生女童3-6年级公主3轮爬楼女孩两用儿童双肩包\n",
      " 85 \t398.00\t安踏中国英雄双肩包潮牌街头嘻哈情侣双肩包男女时尚潮流书包背包\n",
      " 86 \t159.00\t安踏背包2020春季新款运动户外时尚男旅行包防水学生书包双肩包\n",
      " 87 \t1169.00\tergobag德国儿童减负护脊护肩书包中小学生书包男女1-5年级\n",
      " 88 \t589.00\t双肩包男士背包商务休闲旅行背包防盗旅游包女大中学生书包电脑包\n",
      " 89 \t219.00\tHype双肩包男女背包2020新款韩版时尚百搭ins高中校园大学生书包\n",
      " 90 \t899.00\t[2020新款]挪威Beckmann小学生书包女男儿童护脊减压背包1-3年级\n",
      " 91 \t98.00 \t迪士尼小学生书包女童1-3-4三四年级冰雪奇缘女孩儿童减负双肩包6\n",
      " 92 \t199.00\t小米双肩包商务旅行背包大容量书包男士时尚多功能笔记本电脑包\n",
      " 93 \t478.00\tHerschel City Offset 中号旅游双肩包男女背包书包潮牌10486\n",
      " 94 \t598.00\tHerschel Pop Quiz 时尚潮流双肩包男女背包书包大容量10011\n",
      " 95 \t998.00\tHerschel Buckingham 双肩包 休闲背包 大容量潮包 书包男10509\n",
      " 96 \t528.00\tHerschel Nova小号时尚潮流校园双肩包女小包书包背包百搭10502\n",
      " 97 \t628.00\tHerschel Nova中号旅游双肩包女2019新款学生背包书包ins10503\n",
      " 98 \t398.00\tHerschel City中号限量款双肩包女2019新款背包男书包时尚10486\n",
      " 99 \t498.00\tHerschel Grove迷你双肩包女迷小包休闲书包背包时尚百搭10261\n",
      "100 \t248.00\tHerschel Settlement 简约百搭双肩包男女生背包书包时尚10005\n",
      "101 \t598.00\tHerschel Dawson 双肩包男书包女ins风潮牌街头背包欧美10233\n",
      "102 \t178.00\tHerschel Daypack 帆布系列双肩包女 背包 休闲书包百搭10076\n",
      "103 \t698.00\tHerschel Little America 百搭旅游双肩包男女背包大容量10014\n",
      "104 \t448.00\tHerschel Retreat 时尚潮流旅游男女双肩包书包背包百搭10066\n",
      "105 \t89.00 \t特步男女双肩包书包2020春季新款运动背包旅游包耐用时尚舒适简约\n",
      "106 \t109.00\t特步双肩包男背包2020春季新款舒适双肩包女运动背包旅游包休闲包\n",
      "107 \t69.00 \t特步男女双肩包2020春季新款男包女包简约舒适条纹拼接旅游包书包\n",
      "108 \t99.00 \t特步男女双肩包2020春季新款几何线条简约男包女包书包旅游包休闲\n",
      "109 \t99.00 \t特步男女双肩包2020春季新品舒适书包简约运动包旅行包休闲包背包\n",
      "110 \t119.00\t特步官方旗舰店2020春季新款男女双肩包书包运动背包旅游包休闲包\n",
      "111 \t89.00 \t特步男女双肩包2020春季新款运动背包书包旅游包时尚简约休闲背包\n",
      "112 \t19.80 \t小米炫彩小背包胸包休闲轻便学生书包户外旅行双肩包男女简约背包\n",
      "113 \t62.90 \t森马双肩包女新款时尚字母休闲学生背包男潮牌书包男韩版高中\n",
      "114 \t448.00\tHerschel Dawson Offset双肩包男学生潮牌书包背包街头欧美10233\n",
      "115 \t248.00\tHerschel Heritage 5-7岁儿童双肩包 时尚小书包背包10312\n",
      "116 \t698.00\tHerschel Retreat 轻量版双肩包男背包时尚休闲书包女潮牌10626\n",
      "117 \t498.00\tHerschel Heritage 秋冬新色双肩包男背包男学生书包10019\n",
      "118 \t109.00\t特步男女双肩包2020春季新款书包登山包背包运动情侣休闲学院风包\n",
      "119 \t498.00\tHerschel Settlement 时尚男女双肩包 休闲背包 书包10005\n",
      "120 \t498.00\tHerschelSupply Parker双肩包男 学生书包 潮流背包登山包10264\n",
      "121 \t698.00\tHerschel Harrison双肩包男 潮流背包 学生双肩书包10325\n",
      "122 \t598.00\tHerschel Thompson 轻量版双肩包 休闲背包 时尚书包10619\n",
      "123 \t498.00\tHerschel x Santa Cruz 联名款儿童双肩包 Heritage书包背包10312\n",
      "124 \t598.00\tHerschel Heritage 5-7岁儿童双肩包反光背包 书包10312\n",
      "125 \t748.00\tHerschel Little America 5岁以上儿童双肩包 书包10589\n",
      "126 \t298.00\tHerschel Retreat 8岁以上儿童双肩包 背包 学生书包10248\n",
      "127 \t498.00\tHerschel Heritage3-4岁儿童双肩包反光背包出游书包10313\n",
      "128 \t748.00\tHerschel Winlaw Studio系列双肩包男 背包 学生书包10189\n",
      "129 \t498.00\tHerschel Heritage 轻量版双肩包 时尚休闲背包 书包10623\n",
      "130 \t69.00 \t特步女子双肩包运动背包2020春季新款纯色贝壳书包都市休闲时尚包\n",
      "131 \t79.00 \t特步男女双肩包2020春季新款时尚运动潮流字母男女生书包双肩背包\n",
      "132 \t99.90 \t森马双肩包女新款韩版多口袋休闲旅行背包ins少女书包高中生\n",
      "133 \t178.00\tHerschel Post 中号双肩包女 潮流背包 学生双肩书包10021\n",
      "134 \t628.00\tHerschel Nova轻量版双肩包女迷小包轻便背包可爱书包10640\n",
      "135 \t598.00\tHerschel Pop Quiz 轻量版双肩包 休闲背包 潮包书包10625\n",
      "136 \t248.00\tHerschel Packable Daypack双肩包男可折叠背包女书包简约10076\n"
     ]
    }
   ],
   "source": [
    "printGoodList(info_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "XPath使用\n",
    "通常与lxml搭配使用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Element html at 0x1d6d208b7c8>"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from lxml import etree\n",
    "import requests\n",
    "url = \"http://www.dxy.cn/bbs/thread/626626#626626\"\n",
    "req = requests.get(url) \n",
    "html = req.text\n",
    "tree = etree.HTML(html) \n",
    "tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {},
   "outputs": [],
   "source": [
    "#提取出来的为列表形式 content 提取错误\n",
    "user = tree.xpath('//div[@class=\"auth\"]/a/text()')\n",
    "# print(user)\n",
    "content = tree.xpath('//td[@class=\"postbody\"]')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<Element td at 0x1d6d2083088>, <Element td at 0x1d6d3141688>, <Element td at 0x1d6d3141388>, <Element td at 0x1d6d31411c8>]\n"
     ]
    }
   ],
   "source": [
    "print(content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = []\n",
    "for i in range(4):\n",
    "    results.append(user[i] +\":\"+content[i].strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['楼医生:我遇到一个“怪”病人，向大家请教。她，42岁。反复惊吓后晕厥30余年。每次受响声惊吓后发生跌倒，短暂意识丧失。无逆行性遗忘，无抽搐，无口吐白沫，无大小便失禁。多次跌倒致外伤。婴儿时有惊厥史。入院查体无殊。ECG、24小时动态心电图无殊；头颅MRI示小软化灶；脑电图无殊。入院后有数次类似发作。请问该患者该做何诊断，还需做什么检查，治疗方案怎样？',\n",
       " 'lion000:从发作的症状上比较符合血管迷走神经性晕厥，直立倾斜试验能协助诊断。在行直立倾斜实验前应该做常规的体格检查、ECG、UCG、holter和X-ray胸片除外器质性心脏病。',\n",
       " 'xghrh:贴一篇“口服氨酰心安和依那普利治疗血管迷走性晕厥的疗效观察”',\n",
       " 'keys:作者：林文华 任自文 丁燕生']"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [],
   "source": [
    "user = tree.xpath('//div[@class=\"auth\"]/a/text()')\n",
    "# print(user)\n",
    "content = tree.xpath('//td[@class=\"postbody\"]')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element td at 0x1d6d2083088>,\n",
       " <Element td at 0x1d6d3141688>,\n",
       " <Element td at 0x1d6d3141388>,\n",
       " <Element td at 0x1d6d31411c8>]"
      ]
     },
     "execution_count": 106,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#content还需要进一步解析\n",
    "content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = []\n",
    "for i in range(0, len(user)):\n",
    "    # print(user[i].strip()+\":\"+content[i].xpath('string(.)').strip())\n",
    "    # print(\"*\"*80)\n",
    "    # 因为回复内容中有换行等标签，所以需要用string()来获取数据\n",
    "    results.append(user[i].strip() + \":  \" + content[i].xpath('string()').strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['楼医生:  我遇到一个“怪”病人，向大家请教。她，42岁。反复惊吓后晕厥30余年。每次受响声惊吓后发生跌倒，短暂意识丧失。无逆行性遗忘，无抽搐，无口吐白沫，无大小便失禁。多次跌倒致外伤。婴儿时有惊厥史。入院查体无殊。ECG、24小时动态心电图无殊；头颅MRI示小软化灶；脑电图无殊。入院后有数次类似发作。请问该患者该做何诊断，还需做什么检查，治疗方案怎样？',\n",
       " 'lion000:  从发作的症状上比较符合血管迷走神经性晕厥，直立倾斜试验能协助诊断。在行直立倾斜实验前应该做常规的体格检查、ECG、UCG、holter和X-ray胸片除外器质性心脏病。贴一篇“口服氨酰心安和依那普利治疗血管迷走性晕厥的疗效观察”作者：林文华 任自文 丁燕生http://www.ccheart.com.cn/ccheart_site/Templates/jieru/200011/1-1.htm',\n",
       " 'xghrh:  同意lion000版主的观点：如果此患者随着年龄的增长，其发作频率逐渐减少且更加支持，不知此患者有无这一特点。入院后的HOLTER及血压监测对此患者只能是一种安慰性的检查，因在这些检查过程中患者发病的机会不是太大，当然不排除正好发作的情况。对此患者应常规作直立倾斜试验，如果没有诱发出，再考虑有无可能是其他原因所致的意识障碍，如室性心动过速等，但这需要电生理尤其是心腔内电生理的检查，毕竟是有一种创伤性方法。因在外地，下面一篇文章可能对您有助，请您自己查找一下。心理应激事件诱发血管迷走性晕厥1例 ，杨峻青、吴沃栋、张瑞云，中国神经精神疾病杂志， 2002 Vol.28 No.2',\n",
       " 'keys:  该例不排除精神因素导致的，因为每次均在受惊吓后出现。当然，在作出此诊断前，应完善相关检查，如头颅MIR(MRA),直立倾斜试验等。']"
      ]
     },
     "execution_count": 115,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "user1-楼医生:  我遇到一个“怪”病人，向大家请教。她，42岁。反复惊吓后晕厥30余年。每次受响声惊吓后发生跌倒，短暂意识丧失。无逆行性遗忘，无抽搐，无口吐白沫，无大小便失禁。多次跌倒致外伤。婴儿时有惊厥史。入院查体无殊。ECG、24小时动态心电图无殊；头颅MRI示小软化灶；脑电图无殊。入院后有数次类似发作。请问该患者该做何诊断，还需做什么检查，治疗方案怎样？\n",
      "****************************************************************************************************\n",
      "user2-lion000:  从发作的症状上比较符合血管迷走神经性晕厥，直立倾斜试验能协助诊断。在行直立倾斜实验前应该做常规的体格检查、ECG、UCG、holter和X-ray胸片除外器质性心脏病。贴一篇“口服氨酰心安和依那普利治疗血管迷走性晕厥的疗效观察”作者：林文华 任自文 丁燕生http://www.ccheart.com.cn/ccheart_site/Templates/jieru/200011/1-1.htm\n",
      "****************************************************************************************************\n",
      "user3-xghrh:  同意lion000版主的观点：如果此患者随着年龄的增长，其发作频率逐渐减少且更加支持，不知此患者有无这一特点。入院后的HOLTER及血压监测对此患者只能是一种安慰性的检查，因在这些检查过程中患者发病的机会不是太大，当然不排除正好发作的情况。对此患者应常规作直立倾斜试验，如果没有诱发出，再考虑有无可能是其他原因所致的意识障碍，如室性心动过速等，但这需要电生理尤其是心腔内电生理的检查，毕竟是有一种创伤性方法。因在外地，下面一篇文章可能对您有助，请您自己查找一下。心理应激事件诱发血管迷走性晕厥1例 ，杨峻青、吴沃栋、张瑞云，中国神经精神疾病杂志， 2002 Vol.28 No.2\n",
      "****************************************************************************************************\n",
      "user4-keys:  该例不排除精神因素导致的，因为每次均在受惊吓后出现。当然，在作出此诊断前，应完善相关检查，如头颅MIR(MRA),直立倾斜试验等。\n",
      "****************************************************************************************************\n"
     ]
    }
   ],
   "source": [
    "for i,result in zip(range(4),results):\n",
    "    print(\"user\"+ str(i+1) + \"-\" + result)\n",
    "    print(\"*\"*100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bs4下次打卡补上"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
